Skip to main content

Data Validation and Quality Control Guide

This guide explains how data quality validation works in the Pipeline Generator Ingestion system and what happens when data passes or fails validation checks.

Table of Contents


Overview

The Pipeline Generator Ingestion system ensures data quality by validating all incoming data before it is stored in your destination. This "validate-before-commit" approach prevents bad data from entering your data warehouse or analytics systems.

Key Benefits

Data Quality Guarantee: Only validated data reaches your destination
Early Detection: Problems are caught before they affect downstream systems
Automated Checks: No manual intervention required
Complete Transparency: Full visibility into what passed and what failed
All-or-Nothing: Either all data is valid and committed, or none is written


How Data Validation Works

The pipeline follows a strict validation process with six stages:

┌─────────────────────────────────────────────────────────┐
│ Your Data Pipeline │
└─────────────────────────────────────────────────────────┘

Stage 1: EXTRACT
├─ Data is fetched from your source (API, Database, etc.)
├─ Raw data is stored temporarily
└─ Status: Data collected, not yet validated



Stage 2: PREPARE
├─ Data is processed and prepared for validation
├─ Deduplication rules are applied
├─ Schema is checked for compatibility
└─ Status: Data ready for quality checks



Stage 3: VALIDATE ⚡ (CRITICAL CHECKPOINT)
├─ NULL Checks: Are required fields populated?
├─ Uniqueness Checks: Are primary keys unique?
├─ Freshness Checks: Is the data recent enough?

├─ ❌ ANY CHECK FAILS
│ ├─ Data is REJECTED
│ ├─ Pipeline stops immediately
│ ├─ Error details are logged
│ ├─ Nothing is written to destination
│ └─ You receive failure notification

└─ ✅ ALL CHECKS PASS
├─ Data is approved for storage
└─ Continue to Stage 4 ↓

Stage 4: COMMIT
├─ Validated data is stored in Delta tables
├─ Ingestion timestamp is added
└─ Status: Data safely stored



Stage 5: DESTINATION WRITE
├─ Data is exported to CSV format (always)
├─ Data is written to your database (if configured)
└─ Status: Data available in destination systems



Stage 6: FINALIZE
├─ Metadata and schemas are stored
├─ Test results summary is created
├─ Temporary files are cleaned up
└─ Status: Pipeline complete ✓

The Critical Validation Checkpoint (Stage 3)

Before validation runs:

  • Data has been extracted from your source
  • Data exists in temporary storage only
  • Nothing has been written to your destination yet

During validation:

  • Multiple quality checks are performed
  • Each check must pass for the pipeline to continue
  • Results are logged in real-time

After validation:

  • If all checks pass: Data proceeds to commit and destination write
  • If any check fails: Pipeline stops, no data is written anywhere

Data Quality Checks

The system performs three types of automated data quality checks:

1. NULL Checks (Required Fields Validation)

Purpose: Ensures that critical fields are populated and not empty.

What it checks:

  • All primary key columns must have values
  • Required fields specified in your configuration must not be NULL

Example scenarios:

ScenarioResult
All user IDs are populated✅ PASS
5 out of 100 records missing email address❌ FAIL
All required fields have values✅ PASS
Primary key column contains NULL values❌ FAIL

When this check fails:

  • You'll see: "NULL CHECK FAILED: Found X null values in column 'column_name'"
  • Common causes: Missing data at source, incomplete API responses, data extraction errors

2. Uniqueness Checks (Duplicate Detection)

Purpose: Ensures that primary keys are unique with no duplicates.

What it checks:

  • Primary key values must be unique across all records
  • Composite keys (multiple columns) must form unique combinations

Example scenarios:

ScenarioResult
100 users, 100 unique user IDs✅ PASS
100 users, 95 unique user IDs (5 duplicates)❌ FAIL
Composite key (user_id + order_id) all unique✅ PASS
Same order ID appears twice for same user❌ FAIL

When this check fails:

  • You'll see: "UNIQUENESS CHECK FAILED: Found X duplicate rows"
  • Common causes: Source system bugs, race conditions, duplicate API responses

3. Freshness Checks (Data Recency Validation)

Purpose: Ensures that data is recent and timely based on your pipeline schedule.

What it checks:

  • Data age compared to your pipeline frequency
  • Whether new data has arrived since the last run

Freshness Thresholds (automatically set based on your schedule):

Your Pipeline RunsWarning AfterError (Reject) After
Every hour2 hours6 hours
Daily25 hours48 hours
Weekly8 days14 days
Monthly32 days62 days

Example scenarios:

Pipeline ScheduleLast Data ReceivedResult
Daily1 hour ago✅ PASS
Daily30 hours ago⚠️ WARNING (continues)
Daily50 hours ago❌ FAIL (rejects)
Hourly7 hours ago❌ FAIL (rejects)

When this check fails:

  • You'll see: "FRESHNESS CHECK FAILED: Data is X hours old (threshold: Y hours)"
  • Common causes: Source system delays, API downtime, network issues, upstream pipeline failures

Important: Warnings don't stop the pipeline - they're just notifications. Only errors stop the pipeline.


Understanding Validation Results

Test Summary Report

After each pipeline run, you'll receive a test summary showing all validation results:

======================================================
DATA QUALITY TEST SUMMARY
======================================================
Pipeline: customer_data_import
Execution ID: exec-20260111-143022
Status: PASSED ✓
======================================================

Total Tests: 6
Passed: 6
Failed: 0
Warnings: 0

Execution Time: 2.3 seconds
======================================================

MAIN TABLE: customers
✓ null_check_customer_id PASSED
✓ null_check_email PASSED
✓ uniqueness_check PASSED
✓ freshness_check PASSED

RELATED TABLE: customer_orders
✓ null_check_order_id PASSED
✓ freshness_check PASSED

======================================================
RESULT: All data quality checks passed
DATA COMMITTED: 1,247 records written successfully
======================================================

Understanding Test States

SymbolStateMeaningWhat Happens
PASSEDTest passed successfullyPipeline continues normally
⚠️WARNINGWarning threshold exceededPipeline continues, but you should investigate
FAILEDTest failedPipeline stops, data is rejected

Configuring Data Quality Tests

Enabling Data Quality Checks

When setting up your pipeline, specify which checks to enable:

Option 1: Enable All Checks (Recommended)

{
"data_quality_tests": ["null", "unique", "freshness"]
}

Option 2: Enable Specific Checks

{
"data_quality_tests": ["null", "unique"]
}

Option 3: Disable All Checks (Not Recommended)

{
"data_quality_tests": []
}

⚠️ Warning: Disabling data quality tests means invalid or duplicate data may reach your destination.

What Each Test Requires

Test TypeRequired ConfigurationChecks
nullPrimary key fields must be definedAll primary key columns for NULL values
uniquePrimary key fields must be definedPrimary key uniqueness across all records
freshnessPipeline schedule (cron expression) must be setData age against automatic thresholds

Example Configuration

{
"pipeline_name": "customer_data_sync",
"table_name": "customers",
"dedupe_keys_json_paths": ["customer_id", "email"],
"cron": "0 0 * * *",
"data_quality_tests": ["null", "unique", "freshness"]
}

This configuration will:

  • Check that customer_id and email are never NULL
  • Check that the combination of customer_id and email is unique
  • Check that data is less than 48 hours old (based on daily schedule)

What Happens When Validation Fails

Immediate Actions

When any validation check fails:

  1. Pipeline Stops Immediately

    • No further processing occurs
    • Data is not written to any destination
    • All temporary data is preserved for investigation
  2. Detailed Error Logging

    • Exact failure reason is recorded
    • Number of affected records is logged
    • Timestamp of failure is captured
  3. Notification

    • Pipeline status is set to FAILED
    • Error details are available in logs
    • Monitoring systems are alerted
  4. Data Preservation

    • Failed data remains in temporary staging area
    • You can review the problematic records
    • Source data is not affected

Example Failure Scenarios

Scenario 1: NULL Check Failure

What Happened:

NULL CHECK FAILED: Found 15 null values in column 'email'

What This Means:

  • 15 customer records are missing email addresses
  • Data quality requirement: all customers must have emails
  • Pipeline rejected all 1,000 records (not just the 15 bad ones)

What You Should Do:

  1. Check source system for missing emails
  2. Verify API is returning complete data
  3. Fix data quality at source
  4. Re-run pipeline once source is corrected

Scenario 2: Uniqueness Check Failure

What Happened:

UNIQUENESS CHECK FAILED: Found 8 duplicate rows based on primary key 'customer_id'

What This Means:

  • 8 customer IDs appear more than once in the incoming data
  • This could indicate duplicate records or improper deduplication
  • Pipeline rejected all records to prevent corruption

What You Should Do:

  1. Investigate why source has duplicates
  2. Check if source API is being called multiple times
  3. Verify deduplication logic is correctly configured
  4. Fix source or adjust deduplication strategy

Scenario 3: Freshness Check Failure

What Happened:

FRESHNESS CHECK FAILED: Data is 52 hours old (threshold: 48 hours for daily pipeline)

What This Means:

  • No new data has arrived in over 2 days
  • Your daily pipeline expects daily data
  • Pipeline rejected the stale data

What You Should Do:

  1. Check if source system is operational
  2. Verify source system is generating new data
  3. Check for upstream pipeline failures
  4. Investigate network connectivity issues
  5. Consider if this is expected (e.g., weekend, holiday)

Data Recovery After Failures

Option 1: Fix and Re-run (Recommended)

  1. Identify and fix the root cause
  2. Re-run the pipeline
  3. Fresh data will be extracted and validated
  4. If validation passes, data will be committed

Option 2: Adjust Configuration

  1. If validation rules are too strict, adjust thresholds
  2. Update pipeline configuration
  3. Re-run pipeline with new settings

Option 3: Temporary Bypass (Use with Caution)

  1. Temporarily disable specific checks
  2. Re-run pipeline
  3. Re-enable checks after resolution

⚠️ Important: Bypassing validation checks should only be done when you're certain the data quality issue is acceptable or has been manually verified.


Best Practices

1. Always Enable Data Quality Tests in Production

Do This:

  • Enable all three test types for critical data
  • Use NULL and uniqueness checks at minimum
  • Monitor test results regularly

Don't Do This:

  • Disable validation to "make the pipeline work"
  • Ignore warning messages
  • Assume source data is always perfect

2. Set Realistic Freshness Thresholds

Your pipeline schedule should match your data frequency:

Data UpdatesRecommended ScheduleWhy
Real-time streamingEvery 15 minutesCatch delays quickly
Daily batch exportsDailyMatch data availability
Weekly reportsWeeklyAppropriate for reporting cycles
Monthly archivesMonthlyMatch archival schedule

Track these metrics over time:

  • Number of validation failures per month
  • Most common failure types
  • Time to resolve validation issues
  • Data quality improvement trends

4. Document Expected Failures

Some failures may be expected:

  • Weekend gaps for business-day-only data
  • Holiday periods with no activity
  • Scheduled maintenance windows

Document these patterns to avoid false alarms.


5. Set Up Alerting

Configure alerts for:

  • ❌ Any validation failure (high priority)
  • ⚠️ Repeated warnings (medium priority)
  • ℹ️ Unusual data volumes (low priority)

Troubleshooting Common Issues

Problem: NULL Check Keeps Failing

Common Causes:

  1. Source system has incomplete data
  2. API authentication is partially failing
  3. Database view/query is missing joins
  4. Field mappings are incorrect

How to Diagnose:

  1. Check recent changes to source system
  2. Verify API credentials are valid
  3. Test source query independently
  4. Review field mapping configuration

Resolution:

  • Fix data quality at source (preferred)
  • Or adjust which fields are required in configuration

Problem: Uniqueness Check Failing with Duplicates

Common Causes:

  1. Source has legitimate duplicates
  2. Incorrect primary key definition
  3. Race conditions in data extraction
  4. Time-based data needs timestamp in key

How to Diagnose:

  1. Query source system directly for duplicates
  2. Review primary key definition - is it truly unique?
  3. Check if multiple pipeline runs are happening simultaneously
  4. Verify if records change over time (needs versioning)

Resolution:

  • Fix duplicates at source
  • Or add timestamp/version to primary key
  • Or use different deduplication strategy

Problem: Freshness Check Failing Repeatedly

Common Causes:

  1. Source system is not producing new data
  2. Pipeline schedule doesn't match data frequency
  3. Upstream data pipeline is failing
  4. Time zone mismatches

How to Diagnose:

  1. Verify source system is operational
  2. Check when last data was actually created (not extracted)
  3. Review upstream pipeline status
  4. Compare timestamps in different time zones

Resolution:

  • Fix upstream data generation issues
  • Or adjust pipeline schedule to match actual data frequency
  • Or correct time zone configurations

Problem: Warning Messages Keep Appearing

What Warnings Mean:

  • Data quality is borderline but acceptable
  • Not severe enough to fail, but needs attention
  • Trend may lead to failures soon

What To Do:

  1. Don't ignore warnings - investigate root cause
  2. Monitor if warnings are becoming more frequent
  3. Address issues before they become failures
  4. Consider if thresholds need adjustment

Problem: All Tests Pass But Data Looks Wrong

Possible Issues:

  • Tests are configured but too lenient
  • Missing additional validation rules
  • Business logic issues not caught by quality checks

What To Do:

  1. Review test configuration
  2. Add more specific validation rules
  3. Consider custom business rule checks
  4. Implement downstream data validation

Getting Help

If you encounter validation issues you can't resolve:

  1. Check the logs: Full error details are always logged
  2. Review test summary: Shows exactly which test failed and why
  3. Examine sample data: Look at the actual records that failed
  4. Check documentation: Review configuration requirements
  5. Contact support: Provide execution ID and error details

Information to Provide When Reporting Issues

  • Pipeline name and configuration
  • Execution ID (from test summary)
  • Complete error message
  • When the issue started
  • Any recent changes to pipeline or source system
  • Sample of affected data (anonymized if sensitive)

Summary

Key Takeaways

✅ Data is validated before being written to destinations
✅ Failed validation = zero data committed (all-or-nothing)
✅ Three check types: NULL, Uniqueness, Freshness
✅ Configure tests based on your data quality requirements
✅ Monitor trends and address warnings proactively
✅ Fix issues at source whenever possible

The Validation Promise

When you see "Pipeline Status: PASSED" with all validation checks marked ✓, you can be confident that:

  • All required fields are populated
  • All primary keys are unique
  • All data is fresh and timely
  • All records meet quality standards
  • All data has been safely committed
  • All data is available in your destination

This guarantee enables you to trust your data and make confident business decisions.


Additional Resources

  • Pipeline Configuration Guide: How to set up and configure pipelines
  • Monitoring Dashboard: View real-time validation results
  • API Documentation: Programmatic access to validation results
  • Support Portal: Get help with validation issues